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Abstract 

The Multiple Sequence Alignment (MSA) is a computational abstraction that 
represents a partial summary either of indel history, or of structural similarity. 
Taking the former view (indel history) , it is possible to use formal automata the- 
ory to generalize the phylogenetic likelihood framework for finite substitution 
models (Dayhoff 's probability matrices and Felsenstein's pruning algorithm) to 
arbitrary-length sequences. In this paper, we report results of a simulation- 
based benchmark of several methods for reconstruction of indel history. The 
methods tested include a relatively new algorithm for statistical marginaliza- 
tion of MSAs that sums over a stochastically-sampled ensemble of the most 
probable evolutionary histories. For mammalian evolutionary parameters on 
several different trees, the single most likely history sampled by our algorithm 
appears less biased than histories reconstructed by other MSA methods. The 
algorithm can also be used for alignment- free inference, where the MSA is ex- 
plicitly summed out of the analysis. As an illustration of our method, we discuss 
reconstruction of the evolutionary histories of human protein-coding genes. 
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1 Introduction 

The Multiple Sequence Alignment (MSA), indispensable to computational se- 
quence analysis, represents a hypothetical claim about the homology beteen 
sequences. MSAs have many different uses, but the underlying hypothesis can 
often be classified as a claim either of structural homology (the 3D structures 
align in a particular way) or of evolutionary homology (the sequences are related 
by a particular history on a given phylogenetic tree) . These types of hypothesis 
are similar, but with subtle (and important) distinctions: at the residue level, 
a claim of evolutionary homology (direct shared descent) is far stronger than a 
claim of structural homology (same approximate fold). Furthermore, both types 
of MSA — evolutionary and structural — typically only represent summaries of 
the respective homologies: some fine detail is often omitted. For example, an 
evolutionary MSA may — or may not — include the ancestral sequences at inter- 
nal nodes of the underlying tree. 

Structural and evolutionary MSAs are often conflated, but they have quite 
different applications. For example, a common use for a structural MSA is 
template-based structure prediction^ where a query sequence is aligned to a tar- 
get of known structure; the success of this prediction reflects the number of 
query-template residues correctly aligned 1 . By way of contrast, a common 
application for an evolutionary MSA is to identify regions or sites under selec- 
tion, the success of which depends on accurate reconstruction of the evolutionary 
history [2)|3J. 

Evaluation of alignment methods is typically done with implicit regard for 
the structural interpretation. Many benchmarks have used metrics based on the 
Sum of Pairs Score (SPS) [2]. In the situation that a query-template pairwise 
alignment is randomly picked out of the MSA, the SPS effectively estimates 
the proportion of homologous residues that are correctly identified. Several 
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alignment methods attempt to maximize the posterior expectation of SPS or 
similar metrics. This appears to improve accuracy, particularly when measured 
with reference to structural homology. However, it does not automatically confer 
evolutionary accuracy — a correct reconstruction of the evolutionary history of 
the sequences. 

Several studies suggest that multiple alignment for evolutionary purposes 
is still a highly uncertain procedure 5 , and that errors therein may signifi- 
cantly bias analyses of evolutionary effects 6|jll|. A useful component of these 
studies is simulation of genetic sequence evolution [6 , which appears to better 
indicate evolutionary accuracy than benchmarks derived from protein structure 
alignments. Simulations can be made quite realistic given the abundance of 
comparative sequence data fi2\ . 

The current state-of-the-art in phylogenetic alignment software is a choice 
between (on the one hand) programs that lack explicit models of the underlying 
evolutionary process, and so are not framed as statistical inference problems 6 , 
and (on the other hand) Bayesian Markov chain Monte Carlo (MCMC) methods, 
which are statistically exact but prohibitively slow 13[[l4]. 

A telling observation is that while substitution rate is routinely measured 
from MSAs and used as an indicator of natural selection, there is relatively little 
analogous use of indel rate. As we report here, it seems highly likely that even 
if indel rate is a useful evolutionary signal (which is eminently plausible), the 
present alignment methods distort measurements of this rate so far as to make 
it meaningless (see especially Figure [T]). 

In this paper, we frame phylogenetic sequence alignment as an approximate 
maximum likelihood (ML) inference. Our inference algorithm assumes that 
the tree is known, requiring a separate tree estimation protocol. While this 
is a strong assumption, it is in principle shared among all progressive aligners 



Insertion-deletion phylogenetics 



5 



(e.g. PRANK [15], Muscle , ClustalW [Tt], MAFFT \T^). The alignment- 
marginalized likelihoods reported by our algorithm allow for statistical tests 
between alternative trees, and the functionality to estimate an initial alignment 
and guide tree from unaligned sequences exists elsewhere in the DART package. 
Our framing uses automata-theoretic methods from computational linguistics 
to unify several previously-disjoint areas of bioinformatics: Felsenstein's prun- 
ing algorithm for the phylogenetic likelihood function 19 , progressive multiple 
sequence alignment 20 , and alignment ensemble representation using partial 
order graphs [21j. Our algorithm may be viewed as a stochastic generalization 
of pruning to infinite state spaces: it retains the linear time and memory com- 
plexity of pruning {0{NL) for N sequences of length L), while moderating the 
biasing effect of the MSA. The algorithmic details of our method are outlined 
briefiy in the Methods, and in more complete, mathematically precise terms 
(with a tutorial introduction) in a separately submitted work. 

Our software implementation of this algorithm is called ProtPal. We mea- 
sured the accuracy of ProtPal relative to leading non-MCMC alignment/reconstruction 
protocols by simulating indels and substitutions on a known phylogeny, with- 
holding the true history and attempting to reconstruct it from the sequences 
at the tips of the tree. The results show that all previous approaches to the 
reconstruction of ancestral sequences introduce significant biases, including sys- 
tematic underestimation of insertions and overestimation of deletions. This 
contradicts previous claims that advances in the statistical foundations of align- 
ment tools, supported by improvements in protein-structure benchmarks, nec- 
essarily improve the accuracy of evolutionary parameter estimates like the indel 



rate 6 22 , 23 



ProtPal introduces less bias than any other methods we tested, including 
PRANK, the state-of-the-art phylogenetic progressive aligner 6 . Based on our 
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tests, ProtPal appears to be the best choice for smaU to moderately-sized analy- 
ses, such as a reconstruction of the history of proteins at the inter-species level in 
human evolutionary history. Using ProtPal to estimate indel rates for ^ 7, 500 
human protein-coding gene families, we find that per- gene indel rates are ap- 
proximately gamma-distributed, with 95% of genes experiencing a mean rate 
of less than 0.1 indel events per synonymous substitution event. We find that 
lengths of inserted and deleted sequences are comparably distributed, having 
medians 5 and 7, respectively. The human lineage appears to have experienced 
unusually many insertions since the human-mouse split. By mapping genes to 
Gene Ontology (GO) terms, we find that the 200 fastest-indel genes are enriched 
for regulatory and metabolic functions. Possible applications and extensions of 
our algorithm include phylogenetic placement, homology detection, and recon- 
struction of structured RNA. 

2 Results 

2.1 Computational reconstruction of simulated histories 

We undertook to determine the ability of leading bioinformatics programs, in- 
cluding ProtPal, to characterize mutation event histories. We simulated indel 
histories on a tree, then attempted to reconstruct the MAP history, if, using 
only knowledge of the sequences S and the phylogeny T (but not the sequence 
alignment). The history H is the aligned set of observed extant and predicted 
ancestral sequences, such that insertion, deletion, and substitution events can 
be pinpointed to specific tree branches (though not to specific time points on 
those branches). 

We then characterized the reconstruction quality both directly, by compari- 
son of H to the true and indirectly, by using H to estimate ^, the evolutionary 
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parameters: 

Ofy = argmax^,P((9'|^, 5', T) = argmax^,P(^, S\T, 0') (1) 

where the latter step assumes a fiat prior, P{0') = const. We then compared 
the history-conditioned parameter estimate to the true 6. 

This statistic is not without its problems. For one thing, we use an initial 
guess of 6 to estimate H. Furthermore, for an unbiased estimate, we should sum 
over all histories, rather than conditioning on the MAP reconstructed history. 
This summing over histories would, however, require multiple expensive calcula- 
tions of P(6'|T, 6>), where conditioning on H requires only one such calculation. 
Furthermore, parameter estimation conditioned on a MAP-reconstructed his- 
tory is the de facto method employed by large-scale genomics studies focusing 



on indels 24 - 27 . 



2.1.1 Simulation model parameters 

The model parameters are 6 = (A% A^, p*, p^, R): the insertion and deletion 
rates (A*,A^), indel length distributions (p%p^) and substitution rate matrix 
(R). Here we focus on the rates (A*,A^). 

As described in Appendix [A] we generated data using an external simulation 
tool, indel-seq-gen, varying insertion (A*), deletion (A^) and substitution rates 
(r) over a range representative of per-gene rates in Amniota evolution (Figure[4|. 
We varied indel rates (with A* = A'^) between 0.005 and 0.08 expected indels per 
unit time, estimating that this range accounts for 95% of human gene families. 
We left the substitution model (R) and indel length distributions (p*,p'^) fixed, 
employing indel-seq-gen's empirically-estimated values. 

We performed simulations on mammalian, amniote and fruitfiy phylogenies, 
using the taxa in those clades for which genomic sequence is actually available. 
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We found generally consistent results, with common trends that were most 
pronounced on the largest of the three trees that we used (the twelve sequenced 
Drosophila species 28 ). In discussing the trends, we will refer specifically to 
the results on this largest of the trees. 

2.1.2 Indel rate estimates 

Overall most accurate We first set out to determine which program, when 
used to analyze a set of unaligned sequences, returns the indel rate estimate 
closest to the true rate. 

We report the ratio of inferred rate to true rate for insertions ^ and dele- 
tions ^ in Figure [ij with each G {^^^^^l defined as in Equation j?! 
No parameter estimate derived from a computationally reconstructed history 
approaches the level of accuracy achieved using the true history (labeled "True 
simulated history" in Figure [T]). 

The results do not always concord with previous benchmarks that have mea- 
sured accuracy using 3D structural alignments: for example, the FSA program, 
one of the most accurate aligners on structural benchmarks 23 , performs poorly 
here. This discordance may be due to the fundamental differences between evo- 
lutionary and structural homology, and the metrics used to assess each. For 
instance, consider a region with many nearby and overlapping insertions and 
deletions. The spatial and temporal location of these insertion and deletion 
events (in particular, the pinpointing of events to branches on the tree) defines 
what the "perfect" evolutionary reconstruction is. In contrast, even given per- 
fect knowledge of the insertion/deletion history, a "perfect" structural alignment 
depends only on the proteins at the tips of the tree, and this alignment could 
differ from the true evolutionary reconstruction. 

Fundamentally, the difference between FSA and ProtPal is the underlying 
metric that is being optimized by each program: FSA attempts to maximize 
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a metric (AMA= Alignment Metric Accuracy) which is essentiahy "structural" 
(in the sense that it predicts how many residues would be correctly aligned in 
a pairwise alignment of two leaf-node sequences, as might be used in structure 
prediction by target-template alignment), while ProtPal attempts to maximize 
a "phylogenetic" metric (the probability of a given evolutionary history). The 
metric we have used in our benchmark (which counts correct reconstruction of 
the number of indel events on branches of the tree) is also "phylogenetic" . By 
contrast. Appendix [B] Figure [7| shows the programs' ranking using the AM A 
metric. FSA perfoms well, with accuracy exceeding that of ProtPal in the 
highest indel rate category. This suggests that the differences between our 
evolutionary benchmark and previous benchmarks are not due to the data, 
but rather the types of metrics that are used to measure alignment accuracy; 
similarly, the differences between the leading programs are primarily due to 
what types of benchmark they are explicitly trying to perform well at. 

All programs other than ProtPal display insertion- t'er^ii^-deletion biases that 
are, to a varying degree, asymmetric. Typically, the asymmetry is that inser- 
tions are underrepresented and deletions overrepresented. ProtPal's bias, which 
is generally less than the other programs, is also the most symmetric: recon- 
structed insertions and deletions are roughly equally reliable, with both slightly 
underestimated. 

Over the distribution of human gene rates used by this benchmark, our 
phylogenetic likelihood approach, ProtPal, provides the most accurate recon- 
structions of both insertion and deletion counts. PRANK, which also uses a 
tree (but no likelihood), avoids insertion-deletion biases to a certain extent, 
although insertion rates are slightly underestimated relative to deletion rates. 
Since ProtPal's MAP history estimation appears similar to the optimization al- 
gorithm of PRANK, we suspect that ProtPal's marginally better performance 
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is due primarily to its main difference in implementation: ProtPal tracks an 
ensemble of possible reconstructions during progressive tree traversal (Section 
|4|, whereas PRANK uses a single "current best guess." 

Effect of indel rate variation To investigate the effect of indel rate variation 
on estimation accuracy, we separate each program's error distributions by indel 
rate (Figure |2|. We find that all programs' accuracy is strongly affected by the 
indel rate used in simulation. 

As the true indel rate increases, most programs' estimates drift towards 
^ ^ 0. This is consistent with the so-called "gap attraction" effect, where 
indels that are nearby in sequence can be misinterpreted as substitution events 



29 . Depending on the phylogenetic orientation of the events, estimated rates 
can be elevated or lowered, with different biases for insertion and deletion rates 
(Figure [3|. 

Gap attraction and other biases operate simultaneously, and are sometimes 
opposed. MUSCLE over-estimates the deletion rate under most conditions, 
but (consistent with a trend where programs have lower ^ at higher indel 
rates) gets the deletion rate roughly correct in the highest-indel-rate category 
of our benchmark. However, the alignments produced by MUSCLE at high 
indel rates are no more "accurate" by pairwise metrics (Appendix [B| Figure ff\ . 
We conjecture that multiple, contradictory types of gap attraction are at work, 
e.g. Figures and[3p. 

After ProtPal, the two most accurate reconstruction methods are PRANK 
and ProbCons (the latter combined with a parsimonious indel reconstruction). 
ProbCons produces more reliable insertion estimates than PRANK in a broad 
range of benchmark categories, is tied with PRANK for deletion estimates, and 
appears robust to indel rate variation. PRANK performs slightly better than 
ProbCons in the slowest indel rate category we considered. ProtPal produces the 
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most reliable estimates overall, outperforming ProbCons in all but the fastest 
indel rate category, and PRANK in all but the slowest. 

Sensitivity to substitution rate As compared to variation of simulated in- 
del rate, variation of simulated substitution rate appears to have little effect 
on the accuracy of indel reconstruction (Appendix [B] Figure [8| . One notable 
exception is FSA, which appears to be affected by the substitution rate more 
than the other programs. For example, when the simulated indel and substitu- 
tion rates are both low, FSA is comparable to the most accurate of the other 
programs (ProtPal); but when the substitution rate is increased, FSA's error is 
greater than the least accurate program (CLUSTALW). Errors in estimating the 
substitution rate are comparable among the programs tested, and are similarly 
correlated with the simulation indel rate (Appendix [b| Figure [9|) . 

2.2 Reconstructed indel histories of human genes 

We present here a comprehensive set of reconstructions accounting for the evo- 
lutionary history of individual codons in human genes. We used genes in the 
Orthologous and Paralogous Transcripts in Clades (OPTIC) database's Am- 
niota set, comprised of the 5 mammals H. sapiens^ M. musculus, C. famil- 



iarise M. domestica, O. anatinus and G. gallus as an outgroup 30 . Con- 
sidering only those families with one unique ortholog per species (approxi- 
mately 7,500 families), we combined tree branch statistics across genes, us- 



ing the species tree in Appendix [C] Figure [TT] Our reconstructions are available 
at http : //biowiki . or g/'-oscar /optic_reconstruction . tarl and we provide 
here various graphical summaries of Amniota evolutionary history. Several neg- 
ative results stand in contrast to earlier-reported trends. 



Insertion-deletion phylogenetics 



12 



Indel rates Insertion and deletion rates are approximately gamma-distributed 
(Figure [4|. Roughly 95% of genes have indel rates < 0.1 indels per synonymous 
substitution. 



Phylogenetic origins In our simulations, ProtPal pinpoints residues' "branch 
of origin" more reliably than other tools, with a 93% accuracy rate (Appendix [B] 
Figure [6|. Many codons appeared to have been inserted following the human- 
mouse split (Appendix [C| Figure 10) 



Branch-specific indel rates Using our reconstructions to estimate the rates 
of indel mutations along specific tree branches, we find evidence of an elevated 
insertion rate in the human (black) branch, as well as on the the Amniota - 
Australophenids (pink) branch (Appendix [C| Figure 10). 



Amino acid distributions Distributions over amino acids differ significantly 
between inserted, deleted and non-indel sequences (Appendix [C]Fi gure 12L In 



general, small residues are over-represented in insertions, in agreement with 
previous studies [31] . 

Indel lengths We find, contrary to a previous study in Nematode 32 , that 
length distributions in the Amniotes are nearly identical between insertions 
and deletions (Appendix [C| Figure 13). The previously-reported result may 



be attributable to the deletion-biased nature of the methods used, particularly 
CLUSTALW and MUSCLE 32 . 



Indel position The position of indels within genes is highly biased towards 
the ends of genes, presumably in large part refiecting annotation error (Ap- 
pendix [C] Figure 14). The bias is strongest for deletions at the N-terminus of 
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the gene, but both insertions and deletions are enriched in both C- and N- 
termini. 

Evolutionary context of indel SNPs We find no general correlation be- 
tween the indel rate for a gene and the number of indel polymorphisms recorded 
for that gene in dbSNP [33 (Appendix [C]Fi gure 15^. 



Gene ontology indel rates No Gene Ontology (GO) categories stand out as 
having significantly lowered or heightened indel rates in any of the three ontolo- 
gies, contrasting with the reported results of a 2007 study using a smaller num- 
ber of genes [sT). An enrichment analysis conducted with GOstat [34] showed 
that the 200 fastest evolving genes in our data are significantly enriched for 
regulatory and metabolic functions. 



3 Discussion 

We developed and analyzed a simulation benchmark that compares programs 
based on their reconstructions of evolutionary history, using instantaneous mu- 
tation rates representative of Amniote evolution. We tested several different tree 
topologies; results were similar on all trees, but most pronounced on the tree 
with the longest branch lengths. We find that most programs distort indel rate 
measurements, despite claims to the contrary. Moreover, the systematic bias 
varies significantly when the rates of substitutions and indels are varied within 
a biologically reasonable range. Many of the programs we rated have been 
ranked in the past, but using benchmarks that use protein structural align- 
ments as a gold standard, rather than evolutionary simulations. Furthermore, 
these previous benchmarks have not directly assessed the reconstruction of evo- 
lutionary history (or summary statistics such as the indel rate), but have used 
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other alignment accuracy metrics such as the Sum of Pairs Score. Ahgnment 
programs that perform weakly on our benchmark have apparently performed 
well on these previous benchmarks. We hypothesize that these benchmarks, 
compared to ours, are less directly predictive of a program's accuracy at histor- 
ical reconstruction, although they may better reflect the program's suitability 
to assist in tasks relating more closely to folded structure, like prediction of a 
protein's 3D structure from a homologous template. We have introduced a new 
notation that describes a general, hidden Markov model-structured likelihood 
function for indel histories on a tree, as well as the structure of the corresponding 
inference algorithm. We have implemented the new method in a freely- available 
program, ProtPal, that allows, for the first time, phylogenetic reconstruction 
with accuracy over a broad range of indel rates. ProtPal is written in C++ as a 
part of the DART package: www . biowiki . org/ProtPall The evolutionary re- 
constructions ProtPal produces are, according to our simulated tests, the most 
accurate of any available tool, for a range of parameters typical of human genes. 

We applied ProtPal to the reconstruction of human gene indel history, us- 
ing families of human gene orthologs from the OPTIC database. We find some 
patterns that agree with previous studies, such as the non-uniform distribu- 
tions over amino acids seen in [31] . Other results stand in contrast - a previous 
study found significantly different length distributions for insertions and dele- 



tions 32 , whereas in our data they appear very similar. Another prediction of 
our reconstruction is an elevated rate of insertions on the human branch since 
the human- mouse split. This contrasts with a previous analysis though the 
data therein was whole genomes, rather than individual protein-coding genes. 
In contrast to 31 , we find no obvious predictive power of the Gene Ontology 
(GO) for indel rates; that is, the indel rate does not appear strongly correlated 
with the presence or absence of any particular GO term-gene association. How- 
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ever, enrichment analysis for GO terms using GOstat [34] showed that the 200 
fastest-evolving genes are significantly enriched for regulatory and metabolic 
function. This apparent discrepancy might be explained by a group of regu- 
latory and metabolic genes which have very high indel rates, but whose small 
number prevent them from skewing the average within their GO categories. 

Many applications which use a fixed- alignment phylogenetic likelihood could 
potentially benefit from ProtPal's reconstruction profiles. For example, phyloge- 
netic placement algorithms estimate taxonomic distributions by evaluating the 
relative likelihoods of placing sequence reads on tree branches 36 . By using se- 
quence profiles exported from ProtPal, these reads could be placed with greater 
attention to indels and a more realistic accounting for alignment uncertainty. 
Homology detection could be done in a similar way, thereby making use of the 
phylogenetic relationship of the sequences within the reference family. It has 
been observed that the detection of positive selection is highly sensitive to the 
alignment used 1 7 . ProtPal could be modified to detect selection using entire 
profiles rather than single alignments, potentially eliminating the bias brought 
on by an inaccurate alignment. 

In summary, multiple alignments are frequently constructed for use in down- 
stream evolutionary analyses. However, except for our method and slow-performing 
MCMC methods, there are no software tools for reconstructing molecular evo- 
lutionary history that explicitly maximize a phylogenetic likelihood for indels. 
Our results strongly indicate that algorithms such as ProtPal (which use such a 
phylogenetic model) produce significantly more reliable estimates of evolution- 
ary parameters, which we believe to be highly indicative of evolutionary accu- 
racy. These results falsify previous assertions that existing, non-phylogenetic 
tools are well-suited to this purpose. Furthermore, we have demonstrated that 
it is possible to achieve such accuracy without sacrificing asymptotic guarantees 
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on time/memory complexity, or resorting to expensive MCMC methods. Prot- 
Pal can reconstruct phylogenetic histories of entire databases on commodity 
hardware, enabhng the large-scale study of evolutionary history in a consistent 
phylogenetic framework. 



4 Methods 

The details concerning generation and analysis of simulated data are contained 
in Appendix [Aj A mathematically complete description of the alignment algo- 
rithm has been submitted as a separate work, and an early version has been 



made available online here: http://arx iv.org/abs/1103.4347i 

4.1 Felsenstein's algorithm for indel models 

Our algorithm may be viewed as a generalization of Felsenstein's pruning re- 
cursion 19 , a widely-used algorithm in bioinformatics and molecular evolution. 
A few common applications of this algorithm include estimation of substitu- 
tion rates |37 ; reconstruction of phylogenetic trees 38 ; identification of con- 
served (slow-evolving) or recently-adapted (fast-evolving) elements in proteins 
and DNA 39 ; detection of different substitution matrix "signatures" (e.g. puri- 
fying vs diversifying selection at synonymous codon positions 40 , hydrophobic 



vs hydrophilic amino acid signatures 41 , CpG methylation in genomes 42 



or basepair covariation in RNA structures 43 ); annotation of structures in 
genomes |44|45l ; and placement of metagenomic reads on phylogenetic trees 36 . 

Felsenstein's algorithm computes P(6'|T, 6) for a substitution model by tab- 
ulating intermediate probability functions of the form Gn{x) = P{Sn\xn = x, ^), 
where Xn represents the individual residue state of ancestral node n, and Sn rep- 
resents all the sequence data that is causally descended from node n in the tree 
(i.e. the observed residues at the set of leaf nodes whose most recent common 
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ancestor is node n). 

The pruning recursion visits all nodes in postorder. Each Gn function is 
computed in terms of the functions Gi and Gr of its immediate left and right 
children (assuming a binary tree): 

Gn{x) = P{Sn\Xn=X,0) 

(Ecc, M^'^a:,Gi{xi)^ (e,, M^%^Gr{xr)) if Ti is uot a leaf 
d{x = Sn) if n is a leaf 

where M^^^ = P{xn = b\xm = o) is the probability that node n has state 6, given 
that its parent node m has state a; and 5{x = Sn) is a Kronecker delta function 
terminating the recursion at the leaf nodes of the tree. These Gn functions are 
often referred to as "messages" in the machine- learning literature 46 . 

Our new algorithm is algebraically equivalent to Felsenstein's algorithm, if 
the concept of a "substitution matrix" over a particular alphabet is extended to 
the countably-infinite set of all sequences over that alphabet. Our chosen class 
of "infinite substitution matrix" is one that has a finite representation: namely, 
the finite-state transducer^ a probabilistic automaton that transforms an input 
sequence to an output sequence, and a familiar tool of statistical linguistics [471. 

By generalizing the idea of matrix multiplication {AB) to two transducers 
{A and 5), and introducing a notation for feeding the same input sequence to 
two transducers in parallel {AoB)^ we are able to write Felsenstein's algorithm 
in a new form (see Section |4.3[) : 



Gfi 



{M^^^Gi) o [M^'^^Gr) if n is not a leaf 
V{Sn) if n is a leaf 



where V{Sn) is the transducer equivalent of the Kronecker delta 5{x = Sn)- 
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The function Gn is now encapsulated by a transducer "profile" of node n. 

This representation has complexity 0{L^) for N sequences of length L, 
which we reduce to 0{LN) by stochastic approximation of the Gn- This ap- 
proximation relies on the alignment envelope |48], a data structure introduced 
by prior work on efficient alignment methods. The alignment envelope is a 
subset of all the possible histories in which most of the probability mass is con- 
centrated. A related data structure is the partial order graph ^21j. Both these 
data structures can be viewed as ensembles of possible histories, in contrast to a 
single "best-guess" reconstruction of the history. Figure [5] shows a state graph, 
with paths through it corresponding to histories relating the two sequences GL 
and GIV. The paths highlighted in blue form a partial order graph, correspond- 
ing to a subset of these histories generated by a stochastic traceback. At each 
progressive traversal step, we sample a high-probability subset of alignments of 
two sibling profiles in order to maintain a bound on the state space size. Note 
that if we sample only the most likely path at every internal node, we essen- 
tially recover the progressive algorithm of PRANK, and if we sample and store 
all solutions, we recover the machine Gn with state space of size 0{L^). 

4.2 Transducer definitions and lemmas 

The definitions and lemmas are presented in a condensed form here, and ex- 
panded upon in [49]. 

A transducer is a tuple (1], ^, <l>, ^5, r, W) where Q is an input alphabet, 
^ is an output alphabet, <l> is a set of states, ^5- G ^ is the start state, (j)E ^ ^ 
is the end state, r C ^ x {QU {e}) x U {e}) x $ is the transition relation, 
and W : r ^ [0, 00) is the transition weight function. 

Suppose that T = {Q, ^, <l>, 0^, r, W) and U = (^]^ ^^ <l>^ 0'^, f^, r^ W) 
are transducers. 
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Let >V(7r) be the product of all transition weights along a state path tt and 
let yV{x : [T] : y) be the sum of such weights for all paths whose input labels, 
concatenated, yield the string x G and whose output labels yield y G 

Equivalence: If T and U have the same input and output alphabets {Q = Q' 
and ^ = and the same sequence weights yV{x : [T] : y) = yV'{x : [U] : 
y) Vx, then we say the transducers are equivalent^ T = U. Less formally, we 
win write T^U if W{x : [T] :y)c^W\x: [U] : y). 

Moore transducers: The Moore normal form for transducers, named for 
Moore machines |50J, associates input/output with three distinct types of state: 
Match^ Insert and Delete. Paths through Moore transducers can be associated 
with (gapped) pairwise alignments of input and output sequences. For any 
transducer T, there exists an equivalent Moore-normal form transducer U with 
|$'|=0(|r|)and |r'| = 0(|r|). 

Composition: If T's output alphabet is the same as /7's input alphabet 

= n^), there exists a transducer, TU = {Q, W), that unifies the 

output of T with the input of U, such that Vx G 1^*, z G (^')*: 

W\x : [TU] •z)= W{x : [T] : y)W\y : [U] : z) (2) 

If T and U are in Moore form, then \^''\ < \^\ x |$'| and |r''| < |r| x |r'|. 

Intersection: If T and have the same input alphabets = 1]'), there 
exists a transducer, ToU = {Q, W"), that unifies the input of T with 

the input of U. The output alphabet is = U {e}) x U {e}), i.e. a 
T-output symbol (or a gap) aligned with a /7-output symbol (or a gap). 

Let alignments{t^u) C (^'')* denote the set of all gapped pairwise align- 
ments of sequences t G ^* and u G (^')*. Transducer T o U has the property 
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that Vx G G G (^0*- 

W'\x :[ToU]:v) = W{x : [T] : t)W\x : [/7] : ix) (3) 

vGalignments{t,u) 

If T and U are in Moore form, then |<l>''| < |^| x |^'| and |r''| < |r| x |r'|. Paths 
through T oU are associated with three-way ahgnments of the input sequence 
to the two output sequences. 

Identity: Let X be a transducer that copies input to output unmodified, so 
XT = TX = T. 

Exact match: For any sequence 6* G 1^*, there exists a Moore- form transducer 
V{S) = ((^,0,<l>,r...) with \^\ = (9(length(6')) and |r| = C>(length(6')), that 
rejects ah input except 5, such that yV{x : [V(5')] : e) = 1 if x = 5, and if 
X S. Note that V(6') outputs nothing (the empty string). 

Chapman- Kolmogorov transducers: A transducer T is probabilistic if yV{x : 
[T] : y) represents a probabihty P{y\x^T): that is, for any given input string, 
X, it defines a probabihty measure on output strings, y. 

Suppose T{t) is a function returning a probabiUstic transducer of the form 
(r^, <l>, <ps^ 4^E-, T-, i.e. a transducer whose transition weight W depends 

on an additional time parameter^ t, and which satisfies the transducer equiva- 
lence T{t)T{t') = T(t + t') \Jt, t'. 

Then T{t) gives the finite-time transition probabilities of a homogeneous 
continuous-time Markov process on the strings ^7*, as the above transducer 
equivalence is a form of the Chapman-Kolmogorov equation. 

If the state space of T is finite, then this equation describes a renormalization 
of the composed state space ^ x $ back down to the original state space <l>. So 
far, only one nontrivial time-dependent transducer is known that solves this 



equation exactly using a finite number of states: the TKF91 model 51 . 
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4.3 The phylogenetic likelihood 

We rewrite the evidence, P{S\T,6) for sequences tree T, and parameters 
0, in the form P{{Sn ' n G C}\R^{Bn}) where {Sn : n e C} denotes the set 
of sequences observed at leaf nodes, {Bn} denotes the stochastic evolutionary 
processes occuring on the branches, and R denotes the probabilistic model for 
the sequence at the root node of the tree. 

The root and branch transducers (i?, {Bn}) represent an alternative view 
of the tree and parameters (T, 0) . The root transducer R outputs from the 
equilibrium or other initial distribution of the process. If (p, c) G T is a parent- 
child pair, then Be = B{Tpc) is a time-dependent transducer parameterized 
by the branch length. In practise, the branch transducers need not satisfy 
the Chapman-Kolmogorov equation for the following constructs to be of use; 
for example, the {Bn} might be approximations to true Chapman-Kolmogorov 
transducers |52) . 

Let R = (0, 1], . . .) be a transducer outputting sequences sampled from the 
prior at the phylogenetic root. 

Let n be a tree node. If n is a leaf, define Fn = T. Otherwise, let (/,r) 
denote the left and right child nodes, and define Fn = {BiFi) o [BrFr) where 
Bn = (1^, 1^, . . .) is a transducer modeling the evolution on the branch leading 
to n. 

Diagramatically we can write Fn as 




Bi Br 

Fl Fr 



The phylogenetic likelihood is then fully described by F = RFj^qq^. 
Like i?, transducer F models a probability distribution over output se- 
quences, but accepts only the empty string as an input sequence. This empty 
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input sequence is just a technical formality (transducers must have inputs); if 
we ignore it, we can think of F and R as hidden Markov models (HMMs), 
rather than transducers. R is an HMM that generates a single sequence, F a 
multi-sequence HMM that generates the whole set of leaf sequences. 

Inference with HMMs often uses a dynamic programming matrix (e.g. the 
Forward matrix) to track the ways that a given evidential sequence can be 
produced by a given grammar. 

For our purposes it is useful to introduce the evidence in a different way, by 
transforming the model to incorporate the evidence directly. We augment the 
state space so that the model is no longer capable of generating any sequences 
except the observed {Sn}, by composing i^root'^ forked outputs with exact- 
match transducers that will only accept the observed sequences at the leaves of 
the tree. This yields a model, G, whose state space is of size 0{L^) and, in 
fact, is directly analogous to the Forward matrix. 

If n is a leaf node, then let Gn = V(5'n) where Sn is the sequence at n. 
Otherwise, Gn = {BiGi) o (BrGr)- 

Diagramatically we can write Gn as 




Bi Br 
Gi Gr 

Let G = i^Groot- The evidence is P{{Sn}\R, {Bn}) = W(e : [G] : e). 

The net output of G is always the empty string. The sequences {Sn} are 
recognized as inputs by the V{Sn) transducers at the tips of the tree, but are 
not passed on as outputs themselves. 

Likewise, the input of G is the empty string, because R accepts only the 
empty string on its input. 

We can think of G as a Markov model, rather than an HMM. It has no input 
or output; rather, the sequences are encoded into its structure. 
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Transducer G has 0{L^) states, which is impractically many, so ProtPal 
uses a progressive hierarchy Hn of approximations to the corresponding G^, 
with state spaces that are bounded in size. 

If n is a leaf node, let Hn = V(5'n) = Gn- Otherwise, let Hn = (BiEi) o 
(BrEr) where ^ is a subset defined by sampling complete paths 

through the Markov model = RH^ and adding the i^^-states used by 
those paths to ^e^^ until the pre-specified bound on {^e^I is reached. Then 
G - Mroot- 

The likelihood of a given history may be calculated by summing over paths 
through G consistent with that history. In the simplest cases (e.g. minimal 
Moore- form branch transducers), each indel history corresponds to exactly one 
path, so the MAP indel history corresponds to the maximum-weight state path 
through G. 

4.4 Alignment envelopes 

Let V(5') be defined such that it has only one nonzero- weighted path 

Xo^Wo^ Mi^Wi^ ...^ Wl-1 ^Ml^Wl^Xl 

so a V(5')-state is either the start state (Xq), the end state (Xl)^ a wait state 
(Wi) or a match state {Mi). All these states have the form 0^ where i represents 
the number of symbols of S that have to be read in order to reach that state, 
i.e. a "co-ordinate" into S. All V(5') -states are labeled with such co-ordinates, 
as are the states of any transducer that is a composition involving V(S'), such 
as Gn or Hn- 

For example, in a simple case involving a root node (1) with two children 
(2,3) whose sequences are constrained to be S2,Ss, the evidence transducer is 
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G = RG^oot = R{G2 o Gs) = R{{B2\/{S2)) o {Bs\/{Ss))) 



V[^2] V[^3] 

All states of G have the form g = (r, 62, ^2^2, ^3, ^3^3) where 02, 03 ^ 
{X, VK, M}, so 02^2 ^ {-^225^*25^*2} similarly for 03^3. Thus, each state 
in G is associated with a co-ordinate pair (^2,^3) into (5'2,5'3), as well as a 
state-type pair (02,03)- 

Let n be a node in the tree, let be the set of indices of leaf nodes descended 
from n, and let Gn be the phylogenetic transducer for the subtree rooted at n. 



defined in Section [43] Let be the state space of Gn- 

If m G >Cn is a leaf node descended from n, then Gn includes, as a component, 
the transducer V(5'm)- Any G^-state, ^ G <l>n, is a tuple, one element of which is 
a V(5'm)-state, 0^, where Hs a co-ordinate (into sequence Sm) and is a state- 
type. Define imig) to be the co-ordinate and 0m (^) to be the corresponding 
state-type. 

Let An : 2^^^ be the function returning the set of absorbing leaf indices 

for a state, such that the existence of a finite-weight transition g' ^ g implies 
that ini{g) = im{g') + 1 for ah m e An{g). 

Let (/,r) be two sibling nodes. The alignment envelope is the set of sibling 
state-pairs from Gi and Gr that can be aligned. The function E : x ^ 
{0, 1} indicates membership of the envelope. For example, this basic envelope 
allows only sibling co-ordinates separated by a distance s or less 



^basic(/'^) = , , \^m{f) - in{g)\ < s (4) 

An alignment envelope can be based on a guide alignment. For leaf nodes x, y 
and 1 < i < \ength{Sx)^ let Q{x^i^y) be the number of residues of sequence Sy 
in the section of the guide alignment from the first column, up to and including 
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the column containing residue i of sequence Sx- 

This envelope excludes a pair of sibling states if they include a homology 
between residues which is more than s from the homology of those characters 
contained in the guide alignment: 

^suide(/'^) = . max( \g{m,im{f),n)-in{g)\ , \g{n,in{g),m)-im{f)\ ) < s 

(5) 

Let K{x^ z, j) be the number of match columns (those columns of the guide 
alignment in which both Sx and 5*^ have a non-gap character) between the 
column containing residue i of sequence Sx and the column containing residue 
j of sequence Sy. This envelope excludes a pair of sibling states if they include 
a homology between residues which is more than s matches from the homology 
of those characters contained in the guide alignment: 



^suide(/'^) = . max( |^(m, n) - K{m,im{f),n,in{g))\, 

^ meAi{f),neAr-ig) 

\g{n,in{g),m) - K{n,in{g),m,im{f))\ ) < s 
4.5 OPTIC data analysis 

Data Amniote gene families were downloaded from Ih ttp : //genserv . anatT 
ox . ac . uk/downloads/clades/[ l We restricted our analysis to the ^ 7,500 fam- 
ilies having simple 1:1 orthologies. The same species tree topology (downloaded 
from |http : //genserv . anat . ox . ac . uk/clades/amniota/displayPhylogeny, was 

used for all reconstructions, though branch lengths were estimated separately 
for each family as part of OPTIC. When computing branch-specific indel rates, 
the branch lengths of the species tree were used. 
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Reconstruction and rate estimation Gene families were aligned and re- 
constructed using ProtPal with a 3-rate class Markov chain over amino acids, 
insertion and deletion rates set to 0.01, and 250 traceback samples. Averaged 
and per-branch indel rates were computed with ProtPal using the -pi and -pb 
options. The indel rates were then normalized by the synonymous substitution 
rate for each corresponding nucleotide alignment (taken directly from OPTIC), 
computed with PAML 53 . Residues' origins were determined by finding the 
tree node closest to the root containing a non-gap reconstructed character. 

External data Genes were mapped to Gene Ontology terms via the map- 
ping downloaded from http : //www . ebi . ac . uk/GOA/human_release . htmlfdur- 
ing 10/2010. Indel SNPs per gene were taken from a table downloaded from 
Supplemental Table 5 of [54] . 





5 Figures 
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Figure 1: ProtPal's estimates of insertion and deletion rates are the most 

accurate of any program tested, as measured by the RMSE of ^ values aggre- 
gated over all substitution/indel rate categories. Quantiles containing 90% of 
the data are shown as a bolded portion of the x-axis, and RMSE is shown to 
the right of each distribution, the latter computed as described in Appendix [A| 
Equation 6. No aligner approaches the accuracy of the rates estimated with 
the true alignment, though ProtPal, PRANK, and ProbCons are the top three, 
with ProtPal as the most accurate over all. Many aligners, particularly MUS- 
CLE, CLUSTALW, and MAFFT, significantly underestimate insertion rates 
and overestimate deletion rates. ProtPal and PRANK perform their own an- 
cestral reconstruction and other alignment programs were augmented with a 
most-recent-common-ancestor (MRCA) parsimony as described in [55j . 
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Figure 2: Rate estimation accuracy is highly dependent on the simulated indel 
rate. For instance, PRANK is more accurate at lower indel rates, ProbCons is 
more accurate at higher rates. ProtPal is more accurate than PRANK in all but 
one rate (0.005) and equal or more accurate than ProbCons in all but one rate 
(0.08). The drift towards '''{^7'^ = exhibited by most programs indicates 
that most programs infer proportionally fewer indels as rates are increased, 
likely due to various forms of gap attraction. Color-coded 90% quantiles and 
RMSEs are shown underneath and to the right of each group of distributions, 
respectively. RMSE is computed as described in Appendix [A| Equation 6 
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A Simulation parameters and setup 

Data generation Our simulation study is comprised of alignments simulated 
using 5 different indel rates (0.005, 0.01, 0.02, 0.04, and 0.08 indels per unit 
time), each with 3 different substitution rates (0.5, 1, and 2 expected substitu- 
tions per unit time) and 100 replicates. Time is defined such that a sequence 
evolving for time t with substitution rate r is expected to accumulate rt subsi- 
tutions per site. We employed an independent third-party simulation program, 
indel-seq-gen, specifically designed to generate realistic protein evolutionary his- 
tories 12 . indel-seq-gen is capable of modeling an empirically- fitted indel length 
distribution, rate variation among sites, and a neighbor-aware distribution over 
inserted sequences allowing for small local duplications. Since the indel and 
substitution model used by indel-seq-gen are separate from (and richer than) 
those used by ProtPal, ProtPal has no unfair advantage in this test. 

indel-seq-gen v2.0.6 was run with the following command: 
cat guidetree . tree I indel-seq-gen -m JTT -u xia — nuin_gaiiima_cats 3 -a 
0.372 — branch_scale r/b — outfile simulated.alignment . f a — quiet — out file .format 
f -s 10000 — write_anc 

The above command uses the "JTT" substition model, the "xia" indel fill 
model (based on neighbor effects, estimated from E coli k-12 proteins 12 ), and 
3 gamma-distributed rate categories with shape 0.372. Branch lengths are scaled 
by the substitution rate for simulation rate r, normalized by the inverse of indel- 
seq-gen's underlying substitution rate {b = 1.2) so as to adhere to the above 
definition of evolutionary "time". Similarly, indel rates, which are set in the 
guide tree file guidetree .tree, are scaled by ^ so that tX* insertions/deletions 



Insertion-deletion phylogenetics 



34 



are expected over time t for rate A*. 

The root mean squared error (RMSE) for each error distribution was com- 
puted as follows: 

RMSE = 



\ 



A 

replicates 



The true tree was made available to all programs which can utilize a tree 
(ProtPal, PRANK, MUSCLE), representing the use case in which the true tree 
is known (e.g. via the species tree) but the true alignment is unknown. We ran 
simulations on three different phylogenies: a tree of twelve sequenced Drosophila 
genomes 28 and trees from the mammalian and amniotic clades of the OPTIC 
database. We here report results for the Drosophila tree, which we empirically 
observe to show trends consistent with, but more pronounced than, those of the 
mammalian and amniotic trees. The clearer trends may be due to the Drosophila 
tree being larger than the other trees (12 taxa), or having a diverse range of 
branch lengths (0.001 - 0.59 expected substitutions/site, at the genome- wide av- 
erage rate). The simulation data, reconstructions, and analysis scripts are avail- 



able from http : //biowiki . org/~oscar/simulation_reconstruct ion . tar 



Alignment We investigated several multiple alignment tools [I^ IS 23,56 
in combination with alignment-conditioned reconstruction methods. Programs 
were run with their default settings, with the exception of PRANK and MUS- 
CLE. To specify ancestral inference, the guide tree, and "insertions opening for- 
ever" , PRANK used the extra options "-write anc -t <treefile> +F".PRANK's 
-F option allows insertions to match characters at alignments closer to the root. 
This can be a useful heuristic safeguard when an incorrect tree may produce 
errors in subtree alignments that cannot be corrected at internal nodes closer 
to the root. Since the true guide tree is provided to PRANK, it is safe to treat 
insertions in a strict phylogenetic manner via the +F option. For computational 
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efficiency, ProtPal was provided with a CLUSTALW guide alignment. Any 
alignment of the sequences can be used as a guide, and we chose CLUSTALW 
for its general poor performance, so that ProtPal would gain no unfair advantage 
by the information contained in the guide alignment. MUSCLE was provided 
the guide tree with the additional option "-usetree <treef ile>". 

Muscle v3.6 

MUSCLE -in unaligned. fa -out aligned. fa -usetree guidetree . tree 
PRANK V.080820 

PRANK -d=unaligned. f a -noxml -realbranches -writeanc -o=output_directory 
-t=guidetree . tree +F 

Clustal v2.03 

clustalw -INFILE=unaligned.fa -OUTFILE=aligned. f a 

ProbCons vl.l2 

probcons unaligned. fa > aligned. fa 

FSA vl.08 

fsa unaligned. fa > aligned. fa 

MAFFT v6.818b 

mafft unaligned > aligned. fa 

muscle 3.6 -in <infile> -out <outfile> -usetree <guide tree> 
prank v. 

cluswal 2.03 $ (clustalw) -INFILE=$< -OUTFILE=$@ . clustalw 

probcons 1.12 probcons <infile> 

fsa 1.08 fsa <infile> 

mafft v6.818b mafft <infile> 
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Imputing indel histories The ancestral reconstruction programs ProtPal 
and PRANK were used to directly impute indel histories. The remaining tools 
were augmented to reconstruction tools by post-processing their MSAs using 
the maximum parsimony algorithm described in 55 , with the ambiguous cases 
described therein (e.g. where a column of characters could be equally parsimo- 
niously explained by a deletion on one child branch or an insertion on the other) 
resolved by a uniformly random choice from the possible solutions. Indel rates 
were estimated by counting indel events in MAP reconstructed histories: 

Ofy = argmax^, P(l9' I ^, 5', T) = argmax^,P(^, 6'|T, 6>0 (7) 

where the latter step assumes a flat prior, P{0^) = const. 

This statistic is not without its problems. For one thing, we use an initial 
guess of 6 to estimate H. Furthermore, for an unbiased estimate, we should 
sum over all histories, rather than conditioning on the MAP reconstructed his- 
tory. This summing over histories would, however, require multiple expensive 
calculations of P(6'|T, 6>), where conditioning on H requires only one such calcu- 
lation. We further justify our benchmark of parameter estimates conditioned on 
a MAP-reconstructed history by noting that this the de facto method employed 



by large-scale genomics studies focusing on indels 24 - 27 . 

As well as imputing indel rates from reconstructed histories, we also tried 
using the lambda. pi program in the DAWG package 22 , which estimates indel 
rates from MSAs directly (without attempting reconstruction). 

Estimating substitution rates Substitution rates were estimated for each 
inferred alignment using XRate's built-in EM algorithm and the following simple 
rate matrix. Given an equilibrium distribution over amino acid characters, with 
7Ti defining the proportion of character i, the rate of character i mutating to j 
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is set to riTj where r is the only free rate parameter. XRate's estimate of r is 
taken to be the average substitution rate of the MSA. 

By using indel-seq- gen's branch-scale option and changing the indel rate 
parameters accordingly, we are able to modulate the substitution and indel 
rates independently in the data generation step. This true substitution rate 
and the rate inferred by XRate are then directly comparable. 



B Supplemental Figures: simulated data analy- 



sis 
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C Supplemental Figures: OPTIC analysis 

In addition to estimating indel rates for all genes in the OPTIC set, we per- 
formed various other analyses which were left out of the main text for reasons 
of space limitations. We provide figures those displaying results here. 
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True alignment Inferred alignment 




Insertions nearby both in sequence and on the tree. Multiple substitutions inferred instead of multiple 

nearby indels. Insertions are underestimated, 
deletions are unaffected. 
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One sequence experiences nearby insertion and Multiple substitutions inferred instead of multiple 

deletion events. indels. Insertions and deletions are both 

underestimated. 




T 




Insertions are nearby in sequence, but distant on 
the tree. 



Key 

Insertion event 
9 Deletion event 



Three independent deletions are inferred instead of 
two insertions. Insertions are underestimated and 
deletions are overestimated. 



PRANK and ProtPal are effectively robust to 
situation C by treating insertions and deletions as 
piiyiogenetic events. 



Figure 3: Gap attraction, the canceling of nearby complementary indels, can 
affect insertion and deletion rates in various ways depending on the phylogenetic 
relationship of the sequences involved. All programs are, to some extent, sensi- 
tive to situations A and B whereas phylogenetic aligners can avoid situation C. 
An insertion at a leaf requires gaps at all other leaves - an understandably costly 
alignment move when gaps are added without regard to the phylogeny, result- 
ing in multiple penalization for each insertion. Such a penalization would 
cause most non-phylogenetic aligners to prefer the "Inferred alignment" in case 
C where there are fewer total gaps. Aligners treating indels as phylogenetic 
events would penalize each of the implied multiple deletions and only penalize 
each insertion once, thus preferring the "True alignment" in case C. 
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Insertion Rates - OPTIC Deletion Rates - OPTIC 
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Insertions per synonymous substitution Deletions per synonymous substitution 

Figure 4: Insertion and deletion rates in Amniota show similar distributions, 
with 95% of genes having rates less than approximately 0.1 indels per synony- 
mous substitution. Insertion and deletion rates were estimated using recon- 
structions done with ProtPal from a set of approximately 7,500 protein-coding 
genes from the OPTIC amniote database 30 . Indel rates were normalized by 
the synonymous substitution rate of each gene as computed with PAML 53 so 
that the plotted rate represents the number of expected indels per synonymous 
substitution. Since these rates are conditioned on the MAP reconstructed his- 
tory, there are many alignments whose inferred indel rates are zero (197, 174, 
and 54 for insertions, deletions, and both, respectively). 
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Deletion along branch 




Insertion along branch 




No insertion or deletion 




Figure 5: Each path through this state graph represents a possible evolutionary 
history relating sequences GL and GIV. By using stochastic traceback algo- 
rithms (sampling paths proportional to their posterior probability, blue high- 
lighted states and transitions), it is possible to select a high-probability subset 
of the full state graph. By constructing such a subset at each internal node, it 
is possible to maintain a bound on the state space size during progressive tree 
traversal while still retaining an ensemble of possible histories. 
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Figure 6: ProtPal correctly reconstructs the age of more extant residues than 
any other program tested. The y-ax.is shows the proportion of extant residues 
whose point of origin on the phylogenetic tree was correctly pinpointed by the 
reconstruction. The branch of origin was found by taking the tree node closest 
to the root containing a non-gap reconstructed character. All programs except 
FSA are in the 92%-94% range, owing to the fact that many columns (especially 
at low indel rates) are devoid of indels, making inference of origin trivial (as these 
columns' origin is pre-root). 
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Figure 7: Cross-comparison of AMA scores and rate estimation accuracy re- 
veals that using a single metric to assess alignment accuracy can be unreliable. 
AMA scores were computed for each programs alignment of only leaf sequences 
using cmpalign from the DART package 57 . AMA scores are comparable 



across programs until higher indel rates, where FSA performs best — contrasting 
with Figures 1 and 2 (main text). MUSCLE's accurate deletion rate measure- 
ments at high rates and the low corresponding AMA scores suggest a "cancel- 
lation of biases" . 



Insertion-deletion phylogenetics 



Insertion Rates 



44 



Substitution 
Rate 

■ 0.5 

■ 1.0 

■ 2.0 



"D 

QJ 



QJ 



Deletion Rates 



II 



0.005 



ii 




0.01 



ft) 



0.02 30 



lifiliiill 

illiill liiiiiill 



0.04 



0.08 



Program 

Figure 8: Most programs are relatively robust to variations in the simulated 
substitution rate, as evidenced by the benchmark data grouped according to 
substitution rate. Accuracy of rate estimation is plotted as \true — infer red\ on 
the ?/-axis, with bars grouped by program for each indel rate and 3-tuple of sub- 
stitution rates. Higher substitution rates often lead to higher error, presumably 
because they obscure homologies, making it more difficult to distinguish sub- 
stitutions from indels. FSA appears more sensitive to increased substitutions 
than other programs - at indel rate 0.02, FSA's insertion rates are as accu- 
rate as ProtPal's at 0.5 and 1.0 substitutions per site, whereas at the highest 
substitution rate (2.0), its error exceeds that of CLUSTALW. 
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Figure 9: Substitution rates estimated from multiple alignments display com- 
parable accuracy across methods. 
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Branch - specific Indel Rates - OPTIC 
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Figure 10: Reconstruction allows for estimation of branch-specific indel rates, 
revealing possibly interesting signals of evolution. Indel rates were averaged 
over all alignments, using the species tree shown in Figure 11 The human 



branch {Euarchontoglires - H. sapiens) appears to have experienced unusually 
many insertions. The Amniota - Australophenids (pink) branch has a higher 
deletion than insertion rate, though it is difficult to distinguish an insertion on 
this branch from a deletion on the Amniota - G.gallus (navy) branch. All other 
branches are comparable between insertions and deletions. Each bar is colored 



according to branches in Figure 11 
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80 MYA 

Euarchontoglires 




H. sapiens M. musculus C. familiaris M. domestica O. anatinus G. gallus 



Figure 11: The phylogenetic tree used for analysis of OPTIC data, colored to 



inform the branch-specific Figure 10 



Residue Distribution - OPTIC 




Figure 12: Distributions over amino acids are highly non-uniform, and differ 
between insertions, deletions, and the overall distribution seen in OPTIC. In- 
serted, deleted, and all sequences were separately pooled across all OPTIC genes 
reconstructed and amino acid distributions were computed for each. 
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Figure 13: Lengths of inserted and deleted sequences are similarly distributed, 
in contrast to the conclusions of previous studies, such as 32 , which found 
that deletions were longer relative to insertions in C. elegans sequence data. 
While this may represent a genuine difference in the evolution of human and 
worm genomes, it is likely that the use of deletion-biased aligners (MUSCLE 
and CLUSTALW) affected their conclusions. 
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Insertion Positions within Sequence - OPTIC 
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Figure 14: Indels are highly non-uniform in their distribution across genes: we 
see a 6-fold enrichment for insertions within the N-terminal 1% of the protein 
sequence, and a 1.4- fold increase within the C-terminal 1%. There is an 14-fold 
enrichment in deletions within the N-terminal 1% of the protein sequence, and 
a 1.8-fold increase within the C-terminal 1%. Indel locations are normalized by 
gene length to enable combining data across all OPTIC genes analyzed. This 
may be a mix of genuine biology (e.g. indels occur more often near the ends of 
genes) and artifacts (annotation errors are more likely to occur at the ends of 
genes). 
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Indel rate vs indel SNP 

occurrence - OPTIC Density 
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Figure 15: Visualizing the number of indel SNPs per residue (using only human 
sequence) against the evolutionary indel rate (computed across the Amniote 
clade) shows no correlation. 



